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ABSTRACT 

Summary: We present a new tool to correct sequencing errors in 
lllumina data produced from high-coverage whole-genome shotgun 
resequencing. It uses a non-greedy algorithm and shows comparable 
performance and higher accuracy in an evaluation on real human 
data. This evaluation has the most complete collection of high- 
performance error correctors so far. 

Availability and implementation: https://github.com/lh3/bfc 
Contact: hengli@broadinstitute.org 


1 INTRODUCTION 

Error correction is a process to fix sequencing errors on a sequence 
read by using other overlapping reads that do not contain the errors. 
Many de novo assemblers, in particular short-read assemblers for 
large genomes, use error correction to reduce the complexity of the 
assembly graph such that the graph can be fitted to limited RAM. 
Error correction was first ex pressed as the spectrum alignment 
problem jPevzner et al 1 20m]), whereby we take a set of trusted k- 
mers and attempt to find a sequence with minimal corrections such 
that each k -mer on the corrected sequence is trusted. The majority 
of error correctors are based on this idea and take a greedy approach 
to solving this problem. They make a correction based on the local 
sequence context and do not revert the decision. They may not 
find the sequence with the minimal corrections. We worried that 
the greedy strategy might affect the accuracy given reads from a 
repeat-rich diploid genome, so derived a new algorithm. It is optimal 
provided that we know there is an error-free /c-mer in the read. 


2 METHODS 

Algorithm 1 is the key component of BFC. It defines a state of correction as 
a 4-tuple (z, W, C, p), which consists of the position z of the preceding base, 
the last (fc-l)-mer W ending at z, the set C of previous corrected positions 
and bases (called a solution ) up to i, and the penalty p of solution C. BFC 
keeps all possible states in a priority queue Q. At each iteration, it retrieves 
the state (z, W, C, p) with the lowest penalty p (line 1) and adds a new state 
(z + 1, W[l, k — 2] o a, C,p') if a is the read base or W o a is a trusted 
k- mer. If the first /c-mer in S is error free and we disallow untrusted k -mers 
by removing line 3, this algorithm finds the optimal solution to the spectrum 
alignment problem. 

It is possible to modify the algorithm to correct insertion and deletion 
errors (INDELs) by augmenting the set of the “next bases” at line 2 to: 

A/” = {(j, a)\j £{i-l,i},a£ {A, C, G, T}} U {(i, e)} 


Algorithm 1: Error correction for one string in one direction 
Input: K-mer size k, set ‘H of trusted fc-mers, and one string S 
Output: Set of corrected positions and bases changed to 

Function CorrectErrors(A:,H, S) begin 

Q <— HEAPlNIT() > Q is a priority queue 

HeapPush(Q, (k — 2, S'fO, k — 2], 0, 0)) > 0-based strings 

while Q is not empty do 

1 (z , VF, C, p) <— HeapPopBest(Q) > current best state 

i <— i + 1 

if i = \S\ then return C > reaching the end of S 

2 J\f 4— {(z, A), (z, C), (z, G), (z, T)} > set of next bases 

foreach (j, a) E M do > try all possible next bases 

W' <— W o a t> “o” concatenates strings 

if i — j and a = S'[7] then > no correction 

if W' E 71 then [> good read base; no penalty 

| HeapPush(Q,(j, W'[l,k - l],C,p)) 

else t> bad read base; penalize 

3 |_ HEAPPuSH(Q,(j, W'[l,k - l],C,p+ 1)) 

else if W' E 7i then > make a correction with penalty 

4 [_ HEAPPUSH(Q,(j, l],CU{(j, a)},p+l)) 


In this set, (i, a) substitutes a base at position i, ( i , e) deletes the base and 
(z — 1, a) inserts a base a before i. We have not implemented this INDEL- 
aware algorithm because such errors are rare in lllumina data. 

The worse-case time complexity of Algorithm 1 is exponential in the 
length of the read. In implementation, we use a heuristic to reduce the search 
space by skipping line 4 if the base quality is 20 or higher (Q20) and the 
/c-mer ending at it is trusted, or if five bases or two Q20 bases have been 
corrected in the last lObp window. If BFC still takes too many iterations 
before finding an optimal solution, it stops the search and marks the read 
uncorrectable. 

Given a read, BFC finds the longest substring on which each k -mer 
is trusted. It then extends the substring to both ends of the read with 
Algorithm 1. If a read does not contain any trusted fc-mers, BFC exhaustively 
enumerates all /c-mers one-mismatch away from the first k -mer on the read 
to find a trusted /c-mer. It marks the read uncorrectable if none or multiple 
trusted /c-mers are found this way. 

We provided two related imp lementations of Algorith m 1, BFC-bf and 
BFC-ht. BFC-bf uses KMC2 toeorowicz et all l2Q15h to get exact k- 
mers counts and then keeps /c-mers occu rring three times or more in 
a blocked bloom filter JPutze et all 120071) . BFC-ht uses a combination 
of bloo m filter and in-memo r y hash table to derive approximate k -mer 
counts iMelsted and Pritchard . l20llh and counts of /c-mers consisting of 
Q20 bases. We modified Algorithm 1 such that missing trusted high-quality 
/c-mers incurs an extra penalty. This supposedly helps to correct systematic 
sequencing errors which are recurrent but have lower base quality. 
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Table 1. Performance of error correction 


Prog. 

k 

Time 

RAM 

Perfect 

Chim. 

Better 

Worse 

raw data 

_ 

_ 

_ 

2.40M 

12.4k 

_ 

_ 

BBMap 

31 

3h22m 

33.0G 

2.78M 

12.4k 

505k 

19.2k 

BFC-ht 

31 

7h 15m 

83.5G 

3.03M 

13.6k 

816k 

10 . 8 k 

BFC-ht 

55 

5h51m 

67.9G 

3.05M 

11.7k 

830k 

9.0k 

BFC-bf 

31 

7h32m 

23.3G 

3.01M 

13.1k 

783k 

9.2k 

BFC-bf 

55 

4h41m 

23.3G 

3.05M 

11 . 8 k 

819k 

11.4k 

BLESS 

31 

6h31m 

22.3G 

2.91M 

13.1k 

674k 

20 . 8 k 

BLESS 

55 

5h09m 

22.3G 

3.01M 

11.5k 

775k 

10.3k 

Bloocoo 

31 

5h52m 

4.0G 

2.88M 

14.1k 

764k 

31.5k 

Fermi2 

29 

17hl4m 

64.7G 

3.00M 

17.7k 

849k 

42.8k 

Lighter 

31 

5h 12m 

13.4G 

2.98M 

13.0k 

756k 

30.1k 

Musket 

27 

21h33m 

77.5G 

2.94M 

22.5k 

790k 

36.3k 

SGA 

55 

48h40m 

35.6G 

3.01M 

12 . 1 k 

755k 

12 . 8 k 


4.45 million pairs of ~150bp reads were downloaded from BaseSpace, under the 
sample “NA12878-L7” of project “HiSeq X Ten: TruSeq Nano (4 replicates of 
NA12878)”, and were corrected together. On a subset of two million randomly 
sampled read pairs, the original and the correcte d seque nces were mapped to hs37d5 
(http://bit.ly/GRCh37d5 i with BWA-MEM |lJ |2013|) . A read is said to become 
better (or worse) if the best alignment of the corrected sequence has more (or 
fewer) identical bases to the reference genome than the best alignment of the original 
sequence. The table gives /c-mer size (maximal size used for Bloocoo, fermi2, Lighter 
and Musket), the wall-clock time when 16 threads are specified if possible, the peak 
RAM measured by GNU time, number of corrected reads mapped perfectly, number 
of chimeric reads, number of corrected reads becoming better and the number of reads 
becoming worse than the original reads. For each metric, the best tool is highlighted 
in the bold fontface. 


potentially demonstrates that a non-greedy algorithm might work 
better, though subtle differences in heuristics and hidden thresholds 
between the tools could also play a role. We should note that it is 
possible to tune the balance between accuracy, speed and memory 
for each tool. We have not fully explored all the options. 

In the table, error correctors appear to be faster and more accurate 
when longer fc-mers are in use. A possible explanation is that longer 
fc-mers resolve more repeat sequences and also reduce the search 
space. However, when we use BFC-ht to correct errors in this 
dataset, fermi2 O l2012h derived longer contigs and better variant 
calls with shorter fc-mers. We speculate that this observation is 
caused by reduced fc-mer coverage firstly because there are fewer 
long fc-mers on each read and secondly because longer fc-mers are 
more likely to harbor errors. The reduced fc-rner coverage makes 
it harder to correct errors in regions with low coverage and thus 
increases the chance of breaking contigs. To take advantage of both 
shorter and longer fc-mers, we have also tried a two-round correction 
strategy with two fc-mer sizes. The strategy leads to better numbers 
in the table (861k reads corrected to be better and 9.5k worse) at the 
cost of speed, but does not greatly improve the assembly. We will 
focus on understanding the interaction between error correctors and 
assemblers in future works. 
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server with 20 cores of Intel E5-2660 CPUs and 128GB RAM. 
Precompiled binaries are available through http://bit.ly/biobin and 
the command lines were included in the BFC source code 
package (http://bit.ly/bfc-eval). Notably, BLESS only works with 
uncompressed files. The rest of tools were provided with gzip’ d 
files as input. We have also tried AllPaths-LGjGneirejd all 1201 ll) . 
Fiona dSchulz et all |2014|) and Trowel dLim et alll2014l). but t hey 
requi re more RAM than our machine. QuorUM-1.0.0 dZimin et all 
l2013l) always trims reads, making it hard to be compared to others 
which keep full-length reads. 

As is shown in the table, BBMap is the fastest. BFC, BLESS, 
Bloocoo and Lighter are comparable in speed. Bloocoo is the most 
lightweight. Other bloom filter based tools, BFC-bf, BLESS and 
Lighter, also have a small memory footprint. Most evaluated tools 
have broadly similar accuracy. BFC-ht is more accurate than BFC- 
bf overall, suggesting retaining high-quality fc-mers helps error 
correction; both BFC implementations are marginally better in 
this evaluaton, correcting more reads with fewer or comparable 
overcorrections when a similar fc-mer length is in use, which 
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